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Abstract — We discuss inequalities holding between the vo- 
cabulary size, i.e., the number of distinct nonterminal symbols 
in a grammar-based compression for a string, and the excess 
length of the respective universal code, i.e., the code-based analog 
of algorithmic mutual information. The aim is to strengthen 
inequalities which were discussed in a weaker form in linguistics 
but shed some light on redundancy of efficiently computable 
codes. The main contribution of the paper is a construction of 
universal grammar-based codes for which the excess lengths can 
be bounded easily. 

I. Introduction 

In recent years some interest in the theory of universal 
coding has focused on detecting hierarchical structure in 
compressed data. An important tool for this task are universal 
grammar-based codes [1] which compress strings by trans- 
forming them first into special context-free grammars [2] and 
then encoding the grammars into less redundant strings. This 
article presents several bounds for the vocabulary size, i.e., the 
number of distinct nonterminal symbols in a grammar-based 
compression for a string. Indirectly, the bounds concern also 
the code redundancy, which can be elucidated as follows. 

Let X m:n := (Xk) m <k<n be the blocks of finitely-valued 
variables Xj : SI — » X = {0, 1, D — 1} drawn from 
stationary process (Xk)kei, on (f2, J, P). Assuming expec- 
tation operator E, define ?i-symbol block entropy H(n) := 
H(Xi. n ) = — E \ogP(Xi :n ) and excess entropy E(n) := 
I{X\- n \ X n+ i-2n) = 2H(n) — H(2n), being mutual informa- 
tion between adjacent blocks [3]. 

On the other hand, let C : X + — > X + be a uniquely 
decodable code. For code length |C( )| being an analog of 
algorithmic complexity [2], define 

I c (u:v) :=\C(u)\ + \C(v)\-\C(uv)\ 

as the analog of algorithmic mutual information [4]. We will 
denote the expected normalized code length and its excess as 

H c (n) :=E |C(X 1: „)| log D, 

E c (n) :=EI c (X 1:n : X n+1 .. 2n )\ogD. 

For a uniquely decodable code, noiseless coding inequality 
H c (n) > H(n) is satisfied and the code is called universal 
if compression rate lim ra H G (n)/n equals entropy rate h := 
lim„ H(n)/n for any stationary distribution P((Xk)k£Z <= •)■ 
In fact, the search for codes having the lowest redundancy on 



finite strings can be restated as the task of finding universal 
codes with the smallest excess code length I c '(• : •) since 

limsup [E c (n) - E(n)] > 0, (1) 

limsup \E c (n) - E c " (n)l > if H c '(•) > H c '(-), (2) 

for any universal codes C and C", cf. [5], [6]. 

The specific aim of the present note is to justify links 
between the vocabulary size and excess code length I c '(• : •) 
for certain universal grammar-based codes. A weaker form 
of this connection was mentioned in the context of following 
linguistic investigations, cf. [5], [7]: 

(i) Majority of words in a natural language text can be iden- 
tified as frequently repeated strings of letters. Grammar- 
based codes can be used to detect these repeats. Distinct 
words of the text happen to get represented as dis- 
tinct nonterminal symbols in an approximately smallest 
context-free grammar for the text [8], [9]. The number 
of different "significantly" often repeated substrings in 
a typical text can be 100 times greater than in a compa- 
rable realization of a memoryless source [7]. 

(ii) There is a hypothesis that excess entropy of a random 
natural language text (imagined as a stationary stochastic 
process with Xi being consecutive letters of the text) 
obeys E(n) x y/n rather than E(n) — as for 
a memoryless source [10] (cf. [6] for a connection of 
such an effect with nonergodicity). We asked whether the 
power-law growth of E(n) can be linked with the known 
empirical power-law growth of the number of distinct 
words in a text against the text length [11]. 

In view of observation (i), our question in (ii) could be restated 
as: Are excess entropy E(n) and the expected vocabulary size 
of some minimal code for string Xi-2 n approximately equal 
for every stationary process? Trying to answer the question, 
we derived inequality (HJ in [5] and sought for further links 
between the excess code length and the vocabulary size. The 
result of [5] concerning the latter is encouraging but too weak. 
It relates the vocabulary size of the smallest grammar in the 
sense of [2] to the Yang-Kieffer excess grammar length rather 
than to the excess length of an actual universal code. 

In this article, we will strengthen the connection. We will 
prove that excess code length I c (u : v) for some grammar- 
based code C is dominated by the product of the length of 
the longest repeated substring in string w := uv and the 



vocabulary size of the code for w. To get this inequality, it 
suffices that C be the shortest code in an algebraically closed 
subclass of codes using a special grammar-to-string encoder. 
There exist universal codes satisfying this requirement. 

Besides the mentioned dominance, we will justify an in- 
equality in the opposite direction and, additionally, show that 
the vocabulary size of an irreducible grammar for string w 
cannot be less than the square root of the grammar length, cf. 
[7], [1]. This pair of inequalities might be used to lower-bound 
the redundancy of codes based on irreducible grammars. 

The exposition is following. Section [LI] reviews grammar- 
based coding. We construct local grammar-to-string encoders 
( III-Ab and define minimal codes ( III-Bb with respect to some 
classes of grammars ( IH-Q . Subsection lH-Dl justifies universal- 
ity of certain minimal codes which use local encoders. Section 
Unlpresents the upper (IIII-Al i and the lower (IIII-Bb bounds for 
the excess lengths of a minimal code expressed in terms of its 
vocabulary size. Section ITVl resumes the article. 

II. Grammar-based coding revisited 

Grammar-based compression is founded on the following 
concept. An admissible grammar is a context free-grammar 
which generates singleton language {w}, w e X + , and whose 
production rules do not have empty right-hand sides [1]. In 
such a grammar, there is one rule per nonterminal symbol 
and the nonterminals can be ordered so that the symbols are 
rewritten onto strings of strictly succeeding symbols [1]. 

Hence, an admissible grammar is given by its set of produc- 
tion rules {Ai — > ct\,A 2 — > a 2 , ■ ■■,A n — » «„}, where Ai is 
the start symbol, other Ai are secondary nonterminals, and the 
right-hand sides of rules satisfy a>i £ ({A i+1 , A i+2 , A n } U 
X) + . Since the grammar can be restored also from sequence 

G = (ax, a 2 , a n ), (3) 

we will call G simply the grammar. Its vocabulary size, i.e., 
the number of used nonterminal symbols, will be written 

V[G] := card{Ai,A 2 ,...,A„} = n. 

Let X* = X + U {A}, where A is the empty word. For any 
string a E ({A 2 , A3, ...,A n } U X)*, we denote its expansion 
with respect to G — (a%, a 2 , ...,a„) as (a) G [2], i.e., {{a) G } 
is the language generated by grammar (a, a 2 , 013, a n ). The 
set of admissible grammars will be denoted as Q and Q (w) will 
be the subset of admissible grammars which generate language 
{w}, w e X+. Function r : X+ — > Q such that T(w) <E G{w) 
for all w G X + is called a grammar transform [1]. 

If string w contains many repeated substrings then some 
grammar in Q(w) can "factor out" the repetitions and may 
be used to represent w concisely. It is not straightforward, 
however, how to quantify the size of a grammar. In [1] the 
length of grammar G = (a\, a 2 , av[Gl) was defined as 

\G\~EM, (4) 

where |a| is the length of a £ {{Ai, A 2 , A n } U X)*. 
Function will be called Yang-Kieffer length. 



For a grammar transform, ratio |r(w)| / \ w\ can be quite 
a biased measure of string compressibility. Precisely, transform 
r is called asymptotically compact if 

lim max jn = (5) 

n^oo toGX" 

and for each grammar in T(X + ) each nonterminal has a dif- 
ferent expansion. There is plenty of such transforms [1], [2]. 

Since the compression given by (0 is apparent, consider 
grammar-based codes, i.e., uniquely decodable codes C = 
B(T(-)) : X+ X+, where r : X+ -> Q is a grammar trans- 
form and B : Q — ► X + is called a grammar encoder [1]. We 
have lim n max M£ x" ICX^OI jn > 1 necessarily. Nevertheless, 
there exists a grammar encoder Byk '■ Q — ► X + [ 1 ] such that 

(i) set Byk(G) is prefix-free, 

(ii) \B YK (G)\ < \G\ {A + log D \G\) for some A > 0, 

(iii) C = 5yk(T(-)) is a universal code for any asymptotically 
compact transform T. 

A. Local grammar encoders 

It is hard to analyze the excess lengths of grammar-based 
codes which use Byk given by [1] as their grammar-to- 
string encoder. We will define a more convenient encoder. 
It will represent a grammar as a string resembling list (0 
but, simultaneously, it will constitute nearly a homomorphism 
between some operations on grammars and strings. 

Definition 1: ® : Q x Q — > Q is called grammar joining if 

Gi e Q(wx) A G 2 e Q{w{) =^ G 1 ®G 2 e G{w x w 2 ). 

It would be convenient to use such grammar joining and 
encoder B : Q — > X + that the edit distance between B(G\ © 
G 2 ) and B(Gi)B(G 2 ) be small. Without making the idea too 
precise, such joining and encoder will be called adapted. 

The following example of mutually adapted joining and 
encoders will be used in the next sections. For any function 
/ : U — > W of symbols, where concatenation on domains 
U* and W* is defined, denote its extension onto strings as 
/* : U* 3 x lX2 ...x m 1 ^ f(x 1 )f(x 2 )...f{x m ) G W*. For 
grammars Gi = (an, ct-a, Oi ni ), i = 1,2, define joining 

G\ © G 2 := (A 2 A ni+2 ,H*{a n ),H*(a 12 ),...,Hl(a lni ), 
H* 2 {a 21 ),H* 2 {a 22 ),...,H* 2 {a 2n2 )), 

where Hi(Aj) := Aj +1 and H 2 (Aj) :— Aj +ni+1 for 
nonterminals and H\(x) := H 2 (x) := x for terminals i£l 
Definition 2: B : Q — * X + is a local grammar encoder if 

B(G) = B* S (B N (G)), (6) 

where: 

(i) function B^ : Q — > ({0} UN)* encodes grammars 
as strings of natural numbers so that the encoding of 
grammar G = (ot\, a 2 , a n ) is string 

B N (G) := F*{ ai )DFZ{a 2 )D...DF*{a n ){D + 1), 

which employs relative indexing Fi(Aj) : = D + 1 + j — i 
for nonterminals and identity transformation Fi(x) := x 
for terminals ieX = {0, 1, D — 1}, 



(ii) B s is any function of form B s : {0} UN ^ X+ (for 
technical purposes, not necessarily an injection) — we will 
call Bs the natural number encoder. 
Indeed, local encoders are adapted to joining operation ©. 
For instance, if B{Gi) — UiB$(D + 1) for some grammars 
G it * = 1,2, then B{G 1 © G 2 ) = B S (D + 2)B S (D + 2 + 
\[G 1 ])B a (D)u 1 B a (D)fi 2 Ba(D + 1). 

There exist many prefix-free local encoders. Obviously, set 
Bn{G) itself is prefix-free. Therefore, encoder (O is prefix- 
free (and uniquely decodable) if Bs is also prefix-free, i.e., if 
Bs is an injection and set £?s({0} U N) is prefix-free. 

B. Encoder-induced grammar lengths 

Let us generalize the concept of grammar length. 

Definition 3: For a grammar encoder B, function \B(-)\ 
will be called the B-induced grammar length. 
For example, Yang-Kieffer length | • | is £>-induced for a local 
grammar encoder B = S|(Sn(-))> where 

B s (x) = A for x € {D, D + 1} and B s (x) € X else. (7) 

In the same spirit, we can extend the idea of the smallest 
grammar with respect to the Yang-Kieffer length, discussed in 
[2]. Subclass J C Q of admissible grammars will be called 
sufficient if there exists a grammar transform T : X + — > J, 
i.e., if JC\Q{w) ^ for all w 6 X+. Conversely, we will call 
grammar transform T a J'-grammar transform if T(X + ) C J. 

Definition 4: For grammar length ||-||, ^-grammar trans- 
form r will be called (||-|| , J\minimal grammar transform if 
||r(w)|| < ||G|| for all G G G(w) njandwe X+. 

Definition 5: Code B(T(-)) will be called {B , J)-minimal 
if r is (||-|| ,J) -minimal for a .B-induced grammar length ||-||. 

Definition 6: For a grammar length ||-||, grammar sub- 
classes J ,JC C Q are called \\-\\-equivalent if 

min ||G|| = min ||G|| for all w <E X + . 

GeQ(w)nj Geg(w)nic 

C. Subclasses of grammars 

In section |nll we will bound the excess lengths for (B, J)- 
minimal codes, where B are local encoders and J are some 
sufficient subclasses. In subsection III-DI we will show that 
several of these codes are universal. Prior to this, we have to 
define some necessary subclasses of grammars. 

First, we will say that (a%, a 2 , a n ) is aflat grammar if 
on G X + for i > 1. The set of flat grammars will be denoted 
as J- . Symbol T>k C T will denote the class of k-block inter- 
leaved grammars, i.e., flat grammars (ax, 0:2, ol u ), where 
cti € X fc for i > 1. On the other hand, Bk C T>k will stand for 
the set of k-block grammars, i.e., fc-block interleaved gram- 
mars (uw, 02, a n ), where string u g ({A 2l A3, A n })* 
contains occurrences of all A%, A3, A„ and string w S X* 
has length |w| < fc, cf. [12]. Of course, classes Bk, £>fc, 
B := Ufc>i &k, T> := Ufe>i ^fc> an ^ ^ are sufficient. 

Next, grammar (ot\,oi2-, .■.,««) is called irreducible if 
(i) each string a>i has a different expansion (ai) G and 
satisfies > 1, 



(ii) each secondary nonterminal appears in string ct\ct2---Oi n 
at least twice, 

(iii) each pair of consecutive symbols in strings ax, 0:2, ct n 
appears at most once at nonoverlapping positions [1]. 

The set of irreducible grammars will be denoted as X. Any X- 
grammar transform is asymptotically compact [1] so it yields 
a universal code when combined with grammar encoder £?yk- 
Starting with any grammar G\ G Q(w), one can construct 
an irreducible grammar G2 € G(w) by applying a sequence of 
certain reduction rules until the local minimum of functional 
2 I ■ I — V[-] is achieved [1]. This leads to the following lemma. 

Lemma 1: Classes X and Q are | • | -equivalent. 

Proof: The only reduction rule applicable to a grammar 
minimizing | • | is the introduction of a new nonterminal 
denoting a pair of symbols which appears exactly twice on 
the right-hand side of the grammar, cf. section VI in [1]. This 
reduction conserves Yang-Kieffer length. ■ 

Additionally, we will say that grammar (ax, ct2, ot n ) is 
partially irreducible if it satisfies conditions (i) and (ii) of 
irreducibility, as well as, each pair of consecutive symbols in 
string ct\ appears at most once at nonoverlapping positions. 
Let V stand for the set of partially irreducible grammars. Of 
course, IcPc5 and V is sufficient. 

Although T n V and T are not | • (-equivalent, class T n V 
is sufficient and relates to T partially like X relates to Q. 
Some T n T'-grammar transform T is a modification of the 
longest matching Z-grammar transform [1], [2]. In order to 
compute T(w), we start with grammar {A\ — > w} and we 
replace iteratively the longest repeated substrings u in the start 
symbol definition with new nonterminals Ai — > u until there 
is no repeat of length \u\ > 2. T(w) is the modified grammar. 



D. Universal codes for local encoders 

Neuhoff and Shields proved that any (Bns, B) -minimal code 
is universal for some encoder Bns and the class of block 
grammars B [12]. Encoder £?ns resembles a local encoder. 
The main difference is encoding nonterminals Ai as strings of 
length [\og D V[G]J + 1 rather than strings of length \B S (D + 
Therefore we can establish the following proposition. 

Theorem 1: Let Bs be such a prefix-free natural number 
encoder that |£?s( - )l is growing and 

limsup|5 s (n)|/log Z) n= 1. (8) 

n — >oo 

Then for any sufficient subclass of grammars J D B, 
every (B^(B^(-)), ^-minimal code G is universal, that is, 
lim„ H (n)/n = h and limsup„ K (X\ xn )/n < h almost 
surely for every stationary process (Xk)kez- 

Proof: Consider Z?fc-grammar transforms T^. For e > 
and stationary process {Xk)kez with entropy rate h, let k(n) 



be the largest integer k satisfying k2 k ( H+e ^ < n. We have 

ioggV[r fc( n)H] . , , 

limsup max — — < h + 2e, 

n^oo wex« k(n) 

lim EV[r fe(n) (X 1:n )]-fc(n)/n = 0, 

n— »oo 

lim V[Tfc( n ) (-X"i ;n )] • k(n)/n = almost surely, cf. [12]. 

ti — >oo 

Since lim n fc(n) = oo, a (£?, -minimal code is universal if 

\B(T k (w))\ < ak\[T k (w)] + 7 (fc)~ l ogz5 V[T fc («;)], 

where a > and lim/. j(k) = 1. In particular, this inequality 
holds for (O, ©, and growing |Ss(")l- * 
The prefix-free natural number encoder B$ satisfying ([8]l 
can be chosen, e.g., as the D-ary representation u> : N — * X* 
[13], \w(n)\ =l{n), where 



£(n) := 



1 if n < D, 

£(Llo gl) nJ) + Llog D nJ +1 ifn>,D. 



Alternatively, we can use the Z?-ary representation 5 : N — * X* 
[13], \S(n)\ = 1 + 2 [\og D (l + [\og D n\)\ + [\og D n\. 

III. Bounds involving the vocabulary size 

We will derive several inequalities for the vocabulary size 
of certain minimal grammar-based codes. Frankly speaking, 
code universality is irrelevant for the proofs. It is important, 
however, that the codes use the local grammar encoders. 

A. Upper bounds for the excess lengths 

We will begin with defining several operations on grammars. 
For strings u, v G X* with n = \u\, m = \v\, and w — 
uv, define the left and right croppings of grammar G = 

(ai, a 2 , a„) € Q(w) as 

L„G := (x L y L ,a 2 , a n ) € Q(u), 
R m G := (urxr, a 2 , a n ) E Q(v), 

where exactly one of the following conditions holds: 

(i) ai = xlxr and ulVr = A, 

(ii) ct\ = x^AiXR for some nonterminal A\, 2 < i < n, with 
expansion {Ai) G = y L VR, 

Next, for G = (a±, a n ), define its flattening FG := 

(ai, («2) G , (a3) G , (a n ) G ). The secondary part of the 
grammar will be denoted as SG := (A, a 2 , a^, a n ). Ad- 
ditionally, we will use a notation for the maximal length of 
a nonoverlapping repeat in string w 6 X*, i.e., 

L(ui) := max 

u,x,y ,jsGX* : w—xuyuz 

Now we can generalize Theorem 3 from [5], We will show 
that the lengths of some minimal codes are almost subadditive. 
Moreover, the excess lengths are dominated by the vocabulary 
size multiplied by the length of the longest repeat. 

Theorem 2: Let B be local encoder (O. Introduce constants 

W m := max \B s (n)\. 

0<n<D+2+m 



Let r be a (||-|| ,J) -minimal grammar transform for the B- 
induced grammar length ||-||. Consider code G = B(T(-)), 
strings u,v,w G X + , and a grammar class JC which is ||-||- 
equivalent to J . 

(i) If G\,Gi G J Gi © G 2 G K then 

|C(«)| + \C(v)\ - \C(uv)\ > -3W - % W |. (9) 

(ii) If G G J =>> L„G, M„G G JC for all valid n then 

|C(u)| , \C(v)\ < \C(uv)\ + W L(uv), (10) 
|C(u)| + \C(v)\ - \C(uv)\ < \\ST(uv)\\ + W L(uv). (11) 

(iii) UGeJ WGeJC then 

||SrH|| +W L(w) < Wo\[T(w)}(l+L(w)). (12) 

Remark 1: In particular, (O holds for J — Q,V,I while 
inequalities (fl0li-(fT2l hold for J = Q, V,I, V, V k . More- 
over, (fTTT i and ([T2l imply together bound 

|C(u)| + \C(v)\ - \C(uv)\ < W V[T(uv)]{l + L(uv)), (13) 

which we have mentioned in the introduction. 
Remark 2: Theorem 3 in [5] is a restriction of Theorem |2] to 
Bs given by (0 and ||-|| equal to Yang-Kieffer length | • |. 
Proof: 

(i) The result is implied by ||r(w)|| < ||r(u) © r(u)|| and 
||GiffiG 2 || < ||Gi|| + ||G 2 || + | J B s ( J D+2+V[Gi])|+3W , 

where Gi = r(u) and G 2 = T(v). 

(ii) Set n — \u\, m — \v\, and w = uv. The inequalities 
follow from 

||rM||+^ LH>||L„rH||>||r( u )||, 
||rH|| + w L{w) > ||M m rH|| > ||r(«)|| , 

and 

||L„rH|| + ||R m r( W )|| < \\T(w)\\ + \\§T(w)\\+WoMw)- 

(iii) The thesis is entailed by ||Sr(io)|| < ||§Fr(w)|| and 
IISFTHH < W (V[T(w)] - 1) (1 + L(w)) + W . 



B. Lower bounds for the excess lengths 

For Yang-Kieffer length function, the excess lengths can 
be lower-bounded by another quantity related to vocabulary 
size. Firstly, for grammars Gi — (an,cti2, ...,a, ni ), i = 1,2, 
denote the number of their common nonterminal expansions 

V[Gi;G 2 ] := card f] { {a i2 ) G . , (a i3 ) G . , (a ini ) G . } 

i=l,2 

and introduce a new kind of grammar joining 

Gi ® G 2 := (anaai, Q*(ai 2 ), Q*(ai ni ), 
Q 2 ( a 22), G 2 (a 2 „ 2 )), 

where Q±(Aj) := Aj and Q 2 (Aj) := A, +Ill _i for nontermi- 
nals and Qi(x) := Q 2 (a;) := a; for terminals a; G X. 



Recall also Grammar Reduction Rule 5 from [1], which 
deletes useless nonterminals from the grammar and, for all 
nonterminals sharing the same expansion, substitutes one of 
them. Let IG be the result of applying the rule to grammar 
G. 

Theorem 3: Let T be a (| ■ | , J) -minimal grammar trans- 
form. If Gi,G 2 e K, =*> IGi,Gi <g> G 2 e K for some 
grammar class JC being | ■ | -equivalent to J then 

|r»| + |r»| - |r(HI > v[i» ; r»]. (U) 

Remark: In particular, ( TBI holds for J — Q,V,I, J 7 , T>k- 

Proof: Since JC is closed against operation I, there exist 
Gi e JC n Q{u) and G 2 S /C n such that |Gi| = 
|r(u)|, |G 2 | = |r(u)|, and IG, = G t . Hence \oHj\ > 1 for 
(an,ai2, ...,Oi ni ) = Gi and, consequently, 

|I(Gi®G 2 )| < |Gi®G 2 | -V[Gi;G 2 ]min|a„| 

y 

< |G 1 ®G 2 |-V[G 1 ;G 2 ]. (15) 

Notice that |G X <g> G 2 | = + |G 2 |. Thus CE) follows from 
(H3) and from |T(W)| < |I(Gi <g) G 2 )|. ■ 

The next proposition suggests that the size of common 
vocabulary V[r(u); r(u)] for irreducible grammar transforms 
may grow quite fast with the length of strings u and v. 

Theorem 4: (i) If T is a T n P-grammar transform then 

V[rW]L(w) > Vl r HI /2 — -D — 1. (16) 
(ii) If r is an X-grammar transform then 

v[i»] > vir(«;)|/2-£>-i. (17) 

Remark: Bound (ii) was mentioned in [7]. 

Proof: Write G = r(io) and V = V[r(io)] for brevity. 
Notice that x + a + 1 > y^y/2 follows from (y — x)/2 < 
(x + a) 2 for x,y,a> 0. 

(i) At the every second position of the start symbol def- 
inition of G, a pair of symbols can occur only once. 
Thus <0 follows by [|G| - VL(w)}/2 < (V + D) 2 < 
(VL{w)+D) 2 . 

(ii) In this case, any pair of symbols occurs at most once at 
the every second position of all right-hand sides of G. 
Hence, (|G| - V)/2 < (V + D) 2 , which implies ([17). 

■ 

IV. Conclusion 

We have shown that the vocabulary size of certain minimal 
universal grammar-based codes is greater than the excess code 
length divided by the length of the longest repeated substring 
L( ). Recall that L(Xi :n ) cannot be upper-bounded almost 
surely by a universal function o(n) for a block of n symbols 
drawn from an arbitrary stationary stochastic process [14]. 
Nevertheless, L(-Xi :n ) = O(logrt) if {X^i^z is a finite- 
energy process [15]. Hence, an extended Hilberg hypothesis 
[10], stating that a good model for texts in natural languages 
is a finite-energy process with excess entropy E(n) x y/n, 
seems consistent with observations asserting that vocabulary 



size for certain text compressions is n(y/n/ logn) where n is 
the text length [16, Figure 3.12 (b), p. 69]. 

While some premises appealing to ergodic decomposition 
make Hilberg's hypothesis plausible even without the evidence 
of grammar-based compression [6], there remains an important 
theoretical problem. Can we use the vocabulary size or the 
excess length of a grammar-based code to estimate excess 
entropy accurately? Inequality ([T) gives a lower bound for 
E c (n) - E(n) but the upper bounds are less recognized. Al- 
though \E C (n) — E{n) \ = O(logn) when the length of code 
G equals prefix algorithmic complexity and block distribution 
P(Xi-. n ) is recursively computable [6], [4], some results in 
ergodic theory indicate that there is no universal bound for 
\E c (n) — ^(w)] in the class of stationary processes [6], [17]. 

Simpler arguments could be used to infer that difference 
E°(ri) - E(n) is large for certain codes and stochastic 
processes. Consider compressing a memoryless source with 
entropy rate h > 0. We have E(n) = 0. On the other 
hand, let code G be formed by a local encoder satisfy- 
ing ([8) and an irreducible transform T. Then E c (n) = 
n(^hn/ logn) would be implied by Theorems [3] and [4] if 
relation \[T{X 1:n ); T{X n+1:2n )] ~ V[r(Xi :n )] held. 

Let us notice that the bound for E c (n) conjectured for 
memoryless sources and irreducible grammar-based codes is 
almost the same as the inequality established for general 
minimal codes and sources with E(n) x y/n. This should not 
obscure the fact that there is a huge variation of vocabulary 
size for different information sources and a fixed code [7], an 
empirical fact not yet fully understood theoretically. 
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